-
Notifications
You must be signed in to change notification settings - Fork 91
Use of Native Xpath - Faster calculate when big data #906
base: master
Are you sure you want to change the base?
Conversation
Hi @mgogh, thanks for this PR. I arrived at a similar approach in a project I'm using for exploring/prototyping a variety of performance improvements, but hadn't gotten around to getting it into a PR. In my prototype, I arrived at a few additional optimizations.
I'll take some time to bring the pertinent prototype code in for a PR. In the meantime, would you be able to share a form like the one you described? I'd like to add it to my collection of performance-related forms. |
@@ -1561,30 +1561,18 @@ FormModel.prototype.evaluate = function ( | |||
} | |||
|
|||
// try native to see if that works... (will not work if the expr contains custom OpenRosa functions) | |||
if ( | |||
tryNative && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, it is unfortunately not safe to always try native because in this ecosystem we do things that can be evaluated by a native evaluator but return an incorrect result for ODK XForms. I'm surprised no tests failed though (or did they?).
The main (and perhaps only) things are comparisons and arithmetic with date/dateTime strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is a shame. I see a few other cases called out in the spec. This feels like something we could maybe revisit with the tree-sitter-xpath grammar, which is fast enough that we'd still see significant performance improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a huge shame. We've had discussions about a turbo mode that requires form designers to wrap such strings with date
and date-time
.
Yes, indeed, good find. There are some native XPath 1.0. functions that have deviating behavior so some of those would also be an issue.
Curious about tree-sitter! Will it be able to find out if the value of /path/to/node is a date string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alone, the tree-sitter grammar won't be able to do any kind of static type analysis, that's not available in the AST. Here's ways I'd imagine it could help for this case:
- Rule out expressions with functions we know deviate (although we can skip some because they'll fail to compile, which should perform similarly).
- Rule out expressions where we know an argument position is of
date
/date-time
type. - Relate nodeset expressions to their bindings, to identify their type and rule out expressions with those nodesets1.
- Rule out expressions with literals which would be treated as dates.
If all of this sounds like it has overlap with openrosa-xpath-evaluator's parsing responsibilities... it does, heh. But it would probably be a good fit for this case, because tree-sitter is exceptionally fast.
Footnotes
-
This dovetails with other prototype work I've explored, identifying nodeset subexpressions to determine their dependencies. This already works for a huge set of expressions I pulled from openrosa-xpath-evaluator's tests, you can see the test fixtures used to validating that here. The grammar has proven pretty reliable so far. You can see example usage here and here to find nodeset sub-expressions. I have additional prototype work (currently only local) for resolving those sub-expressions to actual related nodes, which so far has worked for everything I've tried except with relative nested-subexpressions (e.g. in a predicate). ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff!
I have additional prototype work (currently only local) for resolving those sub-expressions to actual related nodes, which so far has worked for everything
I'm guessing the challenge might be to prevent this from becoming too costly (possibly negate performance improvements achieved by sending to the native evaluator). Will be awesome if that is possible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s more than possible, it’s a reality! I’ll push up instructions for running and measuring the subexpression logic when I get a chance, but for now I’ll just say that finding subexpressions in all of the cases I pulled from the current evaluator test suite averages 1ms or less even when my computer is throttling under heavy load. In my local stress testing, the native evaluator is also generally ~1ms, the extended evaluator is generally 10+ms.
Hi @eyelidlessness , The XLSForm and datas : Try a smaller list (~1 300 rows) : |
Thank you @jdugh! I meant to reply earlier, but wound up on a yak shaving adventure trying to get the large CSV to load on my local enketo-express/ODK central setup. That aside, this is an awesome case to add to my growing collection of performance stress tests. Earlier today @lognaturel and I discussed a safer, more limited approach to this. Instead of always deferring to the native evaluator, or doing more complex analysis of queries, we'll likely start with a more naive analysis to optimize queries which are obviously straightforward (nodeset references, basic operators with non-ambiguous operands). This isn’t the end of the line for optimization potential I’m exploring, but it will be a big perf boost for a lot of common cases and a lot of the groundwork is already laid. |
If you create a form with select_one_from_file based on big csv file (more than 3000 rows and 10 columns), then you make ten calculate for each columns, it will be very slow. (the select_one was filtered by an other question).
Exemple : In France, we have 34 000 communes, we can filter it with region and department.
Trying to always use Xpath native approach improve the value search when model are big.
If expr contains custom OpenRosa functions, it will use the fork as expected (jsEvaluate) which is slower than the native method.